Replace the FOG IPC Backend with one that doesn't use PContent, the main thread
Categories
(Toolkit :: Telemetry, task, P4)
Tracking
()
People
(Reporter: chutten, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [telemetry:fog:m8])
In bug 1635255 we've been given the okay to build FOG's IPC atop PContent on the main thread while there are few users with little data. We don't need to be on the main thread. We don't want to be tied to only content processes. And the main thread doesn't want us either: it has more important things to do.
And more users with more data are coming.
This bug is for replacing the IPC backend with something better. First and foremost get in contact with IPC Peers and see what the current state of the art is. Then propose a design for the replacement backend. Then implement it.
Reporter | ||
Comment 1•4 years ago
|
||
FOG has now figured out what IPC is gonna look like. Time to see if IPC has any neat ideas about how we could get FOG's IPC off of the main thread and into arbitrary process types. :jld, :nika, do you happen to remember when I asked last year about nifty ways to send IPC opportunistically from and to background threads?
A recap:
- FOG (Firefox on Glean) is the layer that sits atop Glean in Firefox Desktop to replace Firefox Desktop Telemetry as the data collection mechanism
- It provides (amongst other things) IPC support so, e.g., the JS GC can record samples of how long each phase take to a
timing_distribution
- Both sides of the communication are Rust.
- As such, the at-the-time-recommended approach was to use
serde
to bincode-serialize a payload to bytes, send it across FFI as aByteBuf
, send it across IPC in C++, then send it back down FFI to Rust, then bincode-deserialize it on the parent process. All this (and other misc signalling) can be found inFOGIPC.cpp
as well as in ContentParent/Child and friends.
Reporter | ||
Comment 2•3 years ago
|
||
This rebuild should also take into account that current at-clean-process-shutdown IPC flushes (like this one for GMP) do not work as, by the time the child side is being destroyed, the channel is already gone.
We'll want to tackle that anyway because we can't guarantee clean process shutdowns (especially on mobile), but especially because the existing solutions we thought we had aren't working out.
Reporter | ||
Comment 3•3 years ago
|
||
We may wish to change paradigms from "batch-and-send" to "stream" via e.g. DataPipe
. Note that using this from Rust ergonomically will involve getting good interfaces for nsI{Input|Output}Stream
, a prototype of which Nika attached to bug 1782237.
This will have ramifications on the cpu, thread, power etc instrumentation currently relying on occasionally being triggered by FOGIPC batches. I warned the instrumentation owner (:florian) when it went in that this was coming at some point in the future so this shouldn't be a surprise. But we'll want to give a lot of notice.
Comment 4•1 year ago
|
||
I don't think there's more context to add beyond what :chutten and I had discussed a few years ago async and was summarized in the above comments. If there are new questions for me, feel free to add a new ni?
Reporter | ||
Comment 5•11 months ago
|
||
I've spent some time with DataPipe
and I'm not entirely sure that it'll suit our purposes. Synchronizing production with consumption, operation by operation, probably raises the complexity budget too high. (By which I mean: the idea of using the DataPipe
to send (metric_id, sample)
pairs which are picked up by some thread on the parent side which then hands it to Glean has some problems with: ensuring that production and consumption aren't too far from another in rate, bearing the runtime CPU costs of thread dispatch and contention, and trying to figure out the correct ring buffer size for different metrics with different needs (samples on every vsync vs events on every user interaction are orders of magnitude different in frequency and size of samples))
I think the "ideal" form might be a piece of shmem on each parent metric instance that acts as process-aware representation of the metric's storage that can be sync'd down into Glean on the glean.dispatcher
thread as normal. Certainly that most closely mimics what perf was talking about wanting for low-CPU telemetry accumulation (a little math and some sums (plus gaining an unlikely-to-be-locked write lock)). I don't know if that fits any existing IPC mechanisms or patterns, though.
Nika, should I schedule a chat for us to noodle about this? Are there docs of IPC patterns I should read? Am I off-base and should take a second look at DataPipe
since, complexities I identified included, it'll still be the least painful?
Comment 6•11 months ago
|
||
We can definitely set up some time to chat about telemetry related things for accumulation etc. I'll need a bit of a better idea of what the shape of data needs to look like for this shared memory region etc.
I believe while there are some glean telemetry types which are quite simple (i.e. increment-a-global-counter style), there are also some telemetry types which are quite complex (like events). You might want distinct systems for complex objects like events vs. counters.
Description
•